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Evolutionary imprint of activation: The design principles of VSDs 

Eugene Palovcak, Lucie Delemotte, Michael L. Klein, and Vincenzo Carnevale 

Institute for Computational Molecular Science, Temple University, Philadelphia, PA 19122 

Voltage-sensor domains (VSDs) are modular biomolecular machines that transduce electrical signals in cells 
through a highly conserved activation mechanism. Here, we investigate sequence-function relationships in VSDs 
with approaches from information theory and probabilistic modeling. Specifically, we collect over 6,600 unique 
VSD sequences from diverse, long-diverged phylogenetic lineages and relate the statistical properties of this en- 
semble to functional constraints imposed by evolution. The VSD is a helical bundle with helices labeled S1-S4. 
Surrounding conserved VSD residues such as the countercharges and the S2 phenylalanine, we discover sparse 
networks of coevolving residues. Additional networks are found lining the VSD lumen, tuning the local hydrophi- 
licity. Notably, state-dependent contacts and the absence of coevolution between S4 and the rest of the bundle are 
imprints of the activation mechanism on the VSD sequence ensemble. These design principles rationalize existing 
experimental results and generate testable hypotheses. 
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INTRODUCTION 

First discovered in voltage-gated cation channels and 
later identified in voltage-sensitive phosphatases and 
proton channels, the voltage-sensor domain (VSD) is a 
biological molecular device that responds to changes 
in transmembrane (TM) electrical potential (Yu and 
Catterall, 2004; Murata et al., 2005; Ramsey et al., 2006) . 
Moreover, VSDs are demonstrably modular and widely 
distributed in both prokaryotic and eukaryotic cells, in- 
dicating an early origin and consequently vast evolu- 
tionary exploration of sequence space (Yu and Catterall, 
2004; Arrigoni et al., 2013) . Because of their ubiquity in 
cellular electrical signaling, mutations in VSDs give rise 
to various heritable diseases (Catterall, 2010). 

The VSD consists of a four-helix bundle (denoted as 
S1-S4) embedded in a membrane (Cuello et al., 2004; 
Schmidt et al., 2006; Sands and Sansom, 2007) . The S4 
helix contains a highly conserved motif of three to eight 
positively charged residues, referred to as "gating charges," 
which occur at three-residue intervals and occupy a 
common helical face (Liman et al., 1991; Aggarwal and 
MacKinnon, 1996; Seoh et al., 1996). These positive 
charges are stabilized in the TM position through salt- 
bridging interactions with the negative "countercharge" 
residues of S1-S3 (Tiwari-Woodruff et al., 2000; Long 
et al., 2005; Payandeh et al., 2011). These charges exist 
in an aqueous-like environment as water molecules pro- 
trude into the VSD lumen (Freites et al., 2006; Tombola 
et al., 2007; Krepkiy et al., 2009). A cluster of bulky 
hydrophobic residues partitions the lumen into two 
disconnected intracellular and extracellular regions 



(Tao et al., 2010; Lacroix and Bezanilla, 2011; Cheng 
etal., 2013). 

In response to changes in membrane potential, S4 
translates so as to sequentially transfer the gating charges 
across the hydrophobic region (Campos et al., 2007; 
Vargas et al., 2012). Because of the partial solvation of 
the lumen, the TM electric field is focused on this hy- 
drophobic region (Treptow and Tarek, 2006; Jogini and 
Roux, 2007) . A rather small displacement of S4 is there- 
fore sufficient to achieve a complete transfer of positive 
charge across the field (Ahern and Horn, 2005) . This 
conformational change in S4 allows the VSD to act as a 
transducer of electrical signals. 

Major mechanistic features of VSDs appear well iden- 
tified (Borjesson and Elinder, 2008; Catterall, 2010). 
Even so, the protein sequences encoding VSDs in vari- 
ous families can be very different (Guda et al., 2007). 
Here, we aim to show how the functional requirements 
of voltage sensing have shaped the sequence distribu- 
tion of VSDs during evolution. After addressing the spe- 
cific problem of aligning the register of S4 helices in a 
statistically rigorous way, we construct a large multiple 
sequence alignment (MSA) of VSD sequences and search 
for patterns and regularities that reflect the evolution- 
ary "design principles" underlying voltage-sensor func- 
tion. Success in this endeavor implies that from the design 
principles we will be able to recapitulate experimental 
findings and, importantly, formulate novel, testable hy- 
potheses concerning the structural and functional prop- 
erties shared among VSDs. 
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Site-specific residue frequencies are arguably the most 
intuitive feature that can be extracted from MSAs. Be- 
cause the analysis may contain thousands of sequences 
from long-diverged organisms, a frequency distribution 
of amino acids can be extracted for each column/posi- 
tion of the MSA. The peculiar nature of this distribu- 
tion is, in general, informative about the evolutionary 
pressure acting on that particular position (Dokholyan 
et al., 2002) . For instance, a distribution peaked around 
tyrosine, phenylalanine, and tryptophan suggests a func- 
tional requirement for aromaticity at that position. Here, 
we use this sort of single-site analysis to detect "evolu- 
tionarily important" functional sites. 

Much information can also be extracted from joint 
frequency distributions involving pairs of positions. In 
the native folded state, each residue of the polypeptide 
chain engages in residue-residue interactions. The spe- 
cific mutations allowed at a given position are then 
highly dependent on the chemical identity of neighbor- 
ing residues. As a result, frequency distributions of amino 
acids at different positions are expected to be statisti- 
cally dependent on each other. Detection of these cor- 
relations can potentially unveil the physical interactions 
defining the native structure. However, residue-residue 
correlations per se are not able to convey this informa- 
tion; in a highly connected network, such as the con- 
tacting residues in a folded protein, an extensive set of 
pairwise interactions can give rise to long-range indirect 
statistical correlations (Giraud et al., 1999). As a result, 
pairs of correlated positions are not necessarily in con- 
tact. Reconstructing the network of direct interactions 
entails the logical process of inference: within a stochas- 
tic framework (a probabilistic model) , it is assumed that 
residues interact in pairs (are direcdy coupled) in such 
away as to best reproduce the distributions of sequences 
observed in the MSA (Weigt et al., 2009). Analysis of 
these direct couplings (direct-coupling analysis [DCA]) 
is able to highlight functionally crucial residues as those 
involved in enzymatic active sites and regions of confor- 
mational change (Weigt et al., 2009; Morcos et al., 201 1; 
Hopf et al., 2012). It is important to note that these 
methodologies rely on a large, confidently aligned MSA. 

Anticipating our findings, we show how analysis of 
site-specific frequencies and direct evolutionary cou- 
plings (ECs) unveils the major sequence determinants 
of voltage sensing. The requirement for a translation of 
S4 through the electric field results in strong conserva- 
tion of the gating charges, the countercharges, and the 
S2 phenylalanine. With DCA, we identify unexpected 
networks of strongly evolutionarily coupled residues 
surrounding these conserved positions. We also identify 
sites that have been under significant evolutionary pres- 
sure and have not yet been tested for functional rele- 
vance. We suggest these sites as potential targets for new 
mutagenesis experiments. Additionally, we show that 
the helical interfaces between S1-S4 and S3-S4 do not 



feature strong ECs, suggesting that no specific residue- 
residue contacts were conserved along evolution apart 
from the canonical salt bridges. This lack of expected 
coevolution is shown to characterize the entire paddle 
motif, rationalizing chimeragenesis and deletion analy- 
sis experiments highlighting the modular, mobile nature 
of this region. Lasdy, observation of specific evolution- 
arily coupled residue pairs involving positions 96, namely 
25-96 and 49-96 (NavAb numbering) , provides inde- 
pendent support for conformational models of the VSD 
along the activation pathway and suggests that positions 
25, 49, and 96 have important roles in tuning the activa- 
tion properties of the VSD. 

MATERIALS AND METHODS 

Hidden Markov model (HMM) training: "Seed" MSA 

Profile HMMs are among the most important approaches to ana- 
lyze ensembles of sequences. Conceptually, HMMs characterize 
sequences as stochastic processes: discrete time random walks 
across a set of states with defined state and transition probabilities 
(Eddy, 1998). In practice, HMMs are initialized with a protein se- 
quence of interest and iteratively refined by scanning many mil- 
lions of known protein sequences. Homologous sequences are 
then collected in an MSA, an array of statistically significant pro- 
tein sequence alignments. From an abstract perspective, an MSA 
can be thought of as the physical record of evolution's explor- 
atory search for structurally and functionally analogous proteins. 
Subsequent sequence analysis is highly dependent on the quality 
and quantity of alignments in the MSA, and for this reason, we 
give special care to its construction. 

To train an HMM to identify and align VSD sequences from a 
large sequence database, we construct a small, confidently aligned 
"seed" MSA. In the seed MSA, we included homologues repre- 
senting phylogenetically diverse VSD-containing families (Yu and 
Catterall, 2004). Initial multiple alignments were generated using 
ClustalW2 and manually refined such that the alignment was con- 
sistent with structural alignments (in the cases where high resolution 
structures were available) and TM-helix topology predictions (Zhang 
and Skolnick, 2005; Larkin et al., 2007; Viklund and Elofsson, 
2008) . See Table SI for full seed MSA. 

VSDs contain several conserved positive gating charges, which 
come at three-residue intervals on the S4 helix. However, not all 
VSDs contain the same number of gating charges, and this may 
lead to uncertainty in the alignment of S4, as three-residue shifts 
in the alignment register will still match positive charges (Wood 
et al., 2012; Kulleperuma et al., 2013). We were interested in 
quantifying how uncertain such alignments actually were. Position- 
wise reliability of a sequence alignment can be calculated as the 
posterior probability of each symbol in the alignment being emit- 
ted by a pair HMM based on the Blosum62 substitution matrix 
(Wolfsheimer et al., 2012). As an example, for the pairwise align- 
ment of human Hvl S4 to that of human Kvl.2 S4, we calculated 
the positional posterior probability of (a) an alignment generated 
by the Needleman-Wunsch algorithm, and (b) alignments "sam- 
pled" from a pair HMM with Hvl S4 in alternate registers using 
ppAlign (Needleman and Wunsch, 1970; Wolfsheimer et al., 
2012). The posterior probability over the S4 helix in (a) is much 
higher than any of the alignments in (b), suggesting that, given 
the widely used Blosum62 substitution matrix, the alignment of 
S4 is unambiguous. 

An HMM was trained based on the seed MSA using HMMER3.0 
(Eddy, 1998). Scanning this HMM against the NCBI protein 
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sequence database, we collected and aligned 6,652 unique VSD 
sequences containing all four TM helices (Eddy, 1998) . Only align- 
ments with an E-value of <0.01 were included. Sequence logos 
were generated with Weblogo3 (Crooks et al., 2004). 

To describe the constitution of the VSD MSA, hierarchical clus- 
tering was performed to partition the MSA into protein families. 
Specifically, we clustered with the neighbor-joining algorithm 
in the phylogeny inference package (PHYLIP; version 3.695). 
Neighbor-joining is a fast tree-generating method appropriate 
for datasets of the size considered here. By identifying sequences 
in this tree with well-curated annotations from the NCBI protein 
database, branches can be assigned to known VSD-containing 
protein families (Yu and Catterall, 2004). This tree is shown in 
Fig. SI. The seven largest branches corresponded to the first 
three VSDs in voltage-gated calcium and sodium channel pseudo- 
tetramers (Navs + Cavs ITU), the fourth VSD in voltage-gated cal- 
cium and sodium channel pseudotetramers (Navs + Cavs IV), 
eukaryotic voltage-gated potassium channels (Euk. Kvs) , prokaryotic 
voltage-gated potassium channels (Prok. Kvs) , hyperpolarization- 
activated cyclic nucleotide-gated channels (HCNs) , and voltage- 
gated proton channels/voltage-sensitive enzymes (Hvs + VSEs). 
An additional set of VSD sequences did not cluster with VSD se- 
quences with well-curated annotations, and we labeled them "un- 
classified." A piechart of the VSD MSA illustrating these families 
can be found in Fig. 2 A. 

To quantify the diversity of sequences in the VSD MSA, we cal- 
culated the histogram of pairwise sequence identities for the 
aligned regions of all sequences (Fig. 2 B) . This histogram shows 
an approximately unimodal distribution peaked around 20% se- 
quence identity. Additionally, we calculated histograms of pairwise 
sequence identities for MSAs of sequences in each of the seven 
protein families described above (see Fig. SI). Several histograms 
showed polymodal distributions, suggestive of VSD subtypes within 
the identified protein families. 

Kullback-Leibler divergence 

With such a large quantity of homologous protein sequences in 
the VSD MSA, site-specific frequency distributions can be deter- 
mined quantitatively. We can think of the difference between such 
a distribution and a reference as a measure of the "evolutionary 
pressure" exerted on that site. The Kullback-Leibler divergence, 
Dio^i), measures the information that differentiates an observed 
empirical distribution of amino acids, Pj, from a suitable refer- 
ence distribution of amino acids, Qj: 

Single-site amino acid frequency distributions, Pj, were calcu- 
lated from the MSA with python scripts. Biases caused by phyloge- 
netic relationships and incomplete sampling were addressed by 
reweighing sequences in MSA; aligned sequences with high se- 
quence identity (>0.9) were weighted together as a single sequence 
as described by Morcos et al. (2011). By these criteria, the effec- 
tive number of sequences in the MSA was calculated to be 3,821. 

Reference amino acid distributions, Qj, were calculated for 
three topologically distinct regions of TM proteins: the inner 
membrane-water interface, the outer membrane-water inter- 
face, and the lipid-facing TM region. The Orientations of Proteins 
in Membrane (OPM) database consists of membrane protein 
structures with predicted outer and inner membrane interfaces. 
From the OPM, 533 polytopic a-helical TM protein structures 
were downloaded (Lomize et al. , 2006) . A python script was used to 
measure the amino acid frequencies within an 1 1-A window of each 
predicted interface (representing the inner and outer membrane- 
water interfaces) and also between these windows (representing 



the TM region) . These reference distributions and the assignment 
of topological region to individual positions in the NavAb VSD 
are available in Tables SI and S4. 

DCA 

To accurately reconstruct the network of evolutionarily important 
interactions from statistical correlations in the MSA, we infer the 
probabilistic model that makes the least possible number of as- 
sumptions about the underlying structure of the data. This criterion 
is satisfied by the model structure that maximizes the Shannon 
entropy, namely a Potts model. Additionally, determination of the 
interactions (the parameter-learning step) must rely on a compu- 
tationally tractable algorithm. Several solutions to this second 
problem have been proposed recendy (Morcos et al., 201 1; Ekeberg 
et al., 2013) . These modeling approaches have been implemented 
in a family of algorithms known as DCA. DCA has demonstrated 
unprecedented success in dissecting the correlations present in 
MSAs into scored pairwise interactions. The set of all possible 
scored pairs generates an EC score matrix. This matrix appears 
highly similar to contact maps derived from the crystal structures 
of members of the protein family and contains sufficient informa- 
tion to allow ab initio structure prediction, identification of inter- 
faces in homo-oligomers, and prediction of conformational changes 
(Morcos et al., 2011; Hopf etal., 2012). 

Using a large MSA as input, DCA infers the parameters of a 
probabilistic model that reproduces single- and two-site marginals 
of empirical distribution of sequences contained in the MSA. The 
least constrained probabilistic model, known as a Potts model, 
has the following functional form: 




P(ai,. ..,a N ) is the probability of observing the sequence {a\,...,a^\, 
where a, takes values in the alphabet containing the 20 amino 
acids plus the gap ("-") symbol. Local fields, /j,(a), give rise to the 
propensity for an amino acid, a,, to be observed at sequence posi- 
tion i, whereas the coupling constants, Py(a„ «,), similarly encode 
the joint propensity of observing amino acids a, and eij at positions 
i and j, respectively. For a given i,j pair, the set of coupling con- 
stants is arranged in a matrix; the Frobenius norm of this matrix 
is defined as the EC score for the pair The full set of EC scores 
composes the EC score matrix. EC score matrices have been 
shown to be significantly correlated with contact maps. Although 
a rational interpretation of this correlation would posit that resi- 
dues engaged in energetically favorable interactions are likely to 
coevolve, it has yet to be shown to which extent a statistical coupling 
can be identified with an energetic one. Thus, a strict correspon- 
dence between data from double mutant cycle analysis and DCA 
is not expected a priori. We obtained the EC scores of the VSD 
using Matlab scripts for pseudolikelihood maximization DCA de- 
scribed in Ekeberg et al. (2013). Molecular visualizations were 
created with visual molecular dynamics (Humphrey et al., 1996). 

Statistical inference to determine direct couplings has been 
performed using a model structure (Potts model) that assumes a 
large set of parameters; even considering only 1 15 positions from 
the VSD sequences, there are ^2.5 million parameters correspond- 
ing to pairwise coupling constants and local fields. The optimal 
values of these parameters are found, in general, by maximizing 
the likelihood of the data, here approximated bv a computation- 
ally tractable proxy, the "pseudolikelihood" (Ekeberg et al., 2013). 
The quality of the results is therefore strongly dependent on the 
size of the dataset and on a correctly stratified sampling. To test 
whether our MSA of 6,652 unique VSD sequences satisfies these 
requirements, we calculated the EC score matrices for two dis- 
joint subpopulations in our sample. Specifically, we extracted 
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MSAs from the original dataset corresponding to two distinct phy- 
logenetic lineages, the pseudotetrameric Nav/ Cav family and the 
Kv family. These subpopulations were identified based on the par- 
titioning provided by the hierarchical clustering described above, 
resulting into 3,391 and 1,832 sequences for Nav/ Cav and Kv 
VSDs, respectively. EC score matrices for both of these families are 
presented in the supplemental material for comparison with the 
EC score matrix of the full dataset (Figs. S3, A and B, and 5 C) . 

Similarity between these EC score matrices suggests that results 
obtained from the full dataset are not significantly affected by sam- 
pling biases. Furthermore, and somewhat more importantly, this 
similarity indicates that the major features of the sequence ensem- 
ble are preserved on passing from Nav/ Cav VSDs to Kv VSD ones. 

Online supplemental material 

The supplemental material contains the seed MSA used to infer 
the HMM (Table SI). Fig. SI (A-G) shows pairwise sequence iden- 
tity histograms for each of the seven protein families identified in 
our large MSA by hierarchical clustering. Fig. SI H shows the den- 
drogram produced by hierarchical clustering, with assigned 
branches highlighted and labeled. Fig. S2 (A-D) shows the con- 
tact maps of four experimentally determined VSD structures and 
a structural superposition with the NavAb VSD. Fig. S3 (A and B) 
shows EC score matrices for partitioned MSAs containing either 
Nav and Cav or Kv VSD sequences. Tables S2 and S3 provide 
Shaker numbering for positions discussed in this work. Table S4 
contains the membrane protein reference distribution. To help 
the readers map our results on each specific VSD, the MSA is 
available for download in the supplemental material, as well as a 
file with the complete set of parameters of the HMM used in this 
work The file format of the latter (.hmm) is compatible with the 
HMMER webserver (http:/ /hmmer.janelia.org/search/hmmsearch) 
and can be used to generate an MSA using a different (possibly 
the most up to date) sequence database. The online supplemen- 
tal material is available at http://www.jgp.org/cgi/content/full/ 
jgp.201311103/DCl. 



RESULTS 

The S4 dilemma 

To generate the large MSA on which our subsequent 
analysis is based, we first train a profile HMM to recog- 
nize and align VSDs based on a small "seed" alignment 
(Table SI ) . This alignment contains representative VSDs 
from phylogenetically diverse protein families such as 
the voltage-gated potassium, sodium, and calcium chan- 
nels, the voltage-gated proton channels, and the voltage- 
sensitive phosphatases (Yu and Catterall, 2004). Using 
structural alignment and TM helix topology predictions, 
we manually refine automatic MSAs for all of the repre- 
sentative sequences included and generate a confident 
seed alignment, available in the supplemental material 
and described more fully in Materials and methods. 
However, recently published homology models of the 
human voltage-gated proton channel (hHvl ) suggest that 
alignment of the S4 helix among VSDs may be ambigu- 
ous (Wood et al, 2012; Kulleperuma et al., 2013). Be- 
cause hHvl contains three gating charges and other VSDs 
contain four to six, alignments that only consider the 
matching of gating charges produce several different 
possible registers. 



Without structural alignments for all of the VSD rep- 
resentatives included in the seed MSA, we are forced to 
rely on score-based sequence alignment of S4. A statisti- 
cally well-founded metric to judge the reliability of score- 
based alignments is the positional posterior probability 
(Wolfsheimer et al., 2012). The posterior probability 
for a certain position in a pairwise alignment is calcu- 
lated by marginalizing a posterior distribution of scored 
possible alignments. Simply, the posterior probability 
that two particular positions in a pair of sequences should 
be aligned is a measure of how probable it is to see 
those particular residues aligned if the full distribution 
of possible alignments is considered. The profile of pos- 
terior probabilities for a given pairwise alignment there- 
fore shows where the alignment is probable and where 
it is uncertain. This is exactly what we would like to see 
for alternate alignments of S4. 



Needleman-Wunsch 
Alignment of Hvl S4 

1.0 




p.p. 



Kvl 2 MSLAILRVIRLVRVFRIFKL5RHSKGLQILGQT 
++ I ++ I +++ I I + I I + I I ++ I ++ I +++ I 

Hvl EALGLLILLRLURVARIIN GIIISVKT 



B 



Sampled Alignments of Hvl S4 



P.P. 




Kvl. 2 MSLAILRVIRLVRVFRIFKLSRHSKGLQILGQT 
I ++ I I ++ I ++ I ++ + I ++ I + 
Hvl — L ILLRLWRVARI I NGIIIS 



1.0 




MSLAILRVIRLVRVFRIFKL5RHSKGLQILGQT 

++ I ++ I ++++ I +++++ I +++ I + I ++ 

-FEA LGLLILLRLUIRVARIINGI-IIS — 



P.P. 

Kvl. 2 
Hvl 

P.P. 



Kvl. 2 MSL AILRVIRL VRVFRIFKLSRHSKGLQILGQT 
++ I + I I +1 ++ I ++++ ++ I I +++ 
Hvl LGLLIL LRLURVARI ING — IIIS- 

Figure 1. The register of S4 is not ambiguous. (A) Positional 
posterior probability of the Needleman-Wunsch optimal pair- 
wise alignment of Kvl. 2 and human Hvl. (B) Positional poste- 
rior probabilities of alignments sampled from the pair HMMs 
underlying the ppAlign algorithm. Each alignment represents 
a different possible register for the S4 helix of Hvl with respect 
to Kvl. 2. 
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In Fig. 1 A we show the optimal Needleman-Wunsch 
pairwise alignment of hHvl to human Kvl .2, that is, the 
pairwise alignment with the best possible score. Inciden- 
tally, the register of this alignment is consistent with that 
of Kulleperuma et al. (2013). Over the region of the gat- 
ing charges, the posterior probability is high. Taking ad- 
vantage of the probabilistic pair HMMs underlying the 
posterior probability calculation, alternate alignments 
were sampled with hHvl in different registers. Fig. 1 B 
shows that in three alternate registers of S4, the posterior 
probability of the region covering the gating charges is 
decimated. This indicates that given the fundamental as- 
sumptions of score-based pairwise sequence alignment, 
the proper register of hHvl S4 is not ambiguous. This 
method was used to check the reliability of the S4 register 
for other pairs of sequences in the seed MSA. 

Collecting and characterizing a large MSA 
of VSD sequences 

With the technical issue of aligning S4 addressed, we 
train an HMM based on the seed MSA and collect a 
larger MSA of ^6,600 VSD sequences as described in 
Materials and methods. We note that by using an HMM 
trained on a small but diverse seed MSA, we are able to 
automatically construct a much larger MSA than those 
reported previously for the VSD (Guda et al., 2007; Lee 
et al., 2009; Kulleperuma et al., 2013). 

By building an HMM with a phylogenetically diverse 
seed MSA, we expected to collect a dataset of VSDs as 
complete as possible. It is important to confirm that the 
MSA collected does not exclude or over-represent any 
of the known VSD-containing protein families, and that 
sequences in each family are sufficiently diverse to cap- 
ture the VSD sequence variability. 



The 6,652 sequences of the VSD MSA were partitioned 
into clusters representing protein families as described 
in Materials and methods. In brief, the neighbor-joining 
hierarchical clustering method was used to construct 
a dendrogram (PHYLIP). Protein family clusters were 
labeled by identifying sequences with well-curated an- 
notations in each major branch (see Fig. SI for tree 
with family assignments). To assess the sequence vari- 
ability of the dataset, pairwise sequence identities were 
calculated for all pairs of sequences in the full MSA and 
for all pairs within each identified protein family (see 
Fig. SI for protein family sequence identity histograms) . 

As observed in Fig. 2 A, the hierarchical clustering natu- 
rally breaks down the MSA into previously described pro- 
tein families including voltage-gated calcium and sodium 
channels, eukaryotic and prokaryotic voltage-gated potas- 
sium channels, prokaryotic sodium channels, hyperpo- 
larization activated cyclic nucleotide-gated channels, 
voltage-gated proton channels, and voltage-sensitive 
enzymes. Interestingly, the first three VSDs of the pseudo- 
tetrameric voltage-gated sodium and calcium channels 
clustered together, and the fourth clustered separately. 
This may be expected from the findings of Lacroix et al. 
(2013), in which the functionally distinct kinetics of the 
first three "fast" domains of Navl.4 compared with the 
fourth "slow" domain were shown to arise from specific 
sequence differences in their respective VSDs. 

Fig. 2 B shows the distribution of pairwise sequence 
identity, where we observe a single large peak around 20% 
identity. If the dataset consisted of large clusters of mostly 
identical sequences, we would expect to see additional 
large peaks at high identity values. Our MSA therefore 
conforms to our requirements for extensive phylogenetic 
coverage and sufficient intra-family variability. 




Figure 2. The sequence sample is correctly stratified and captures the VSD sequence variability. (A) Breakdown of the MSA into 
protein families. Navs + Cavs I-HI, the first three VSDs in the pseudotetrameric voltage-gated sodium and calcium channels; Navs + 
Cavs IV, the fourth VSD in the pseudotetrameric voltage-gated sodium and calcium channels; Euk. Kvs, eukaryotic voltage-gated potas- 
sium channels; Prok. Kvs, prokaryotic voltage-gated potassium channels; Prok. Navs, prokaryotic voltage-gated sodium channels; HCNs, 
hyperpolarization-activated cyclic nucleotide-gated channels; Hvs + VSEs, voltage-gated proton channels and voltage-sensitive enzymes. 
(B) Histogram of pairwise sequence identities for the MSA. 
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We also note the multimodal character of the pairwise 
sequence identity distributions for individual protein 
families (see Fig. SI). Multimodal histograms of sequence 
similarity are a clear indication of substructure in the 
data. Although our aim here is to confirm sequence di- 
versity in our dataset and not to characterize the VSD 
sequence composition in every potential subfamily, the 
presence of modes in the distribution suggests the exis- 
tence of distinct VSD subtypes within the identified pro- 
tein families and may warrant further attention. 

Single-site analysis of the VSD 

The large VSD MSA is represented as a sequence logo 
in Fig. 3 (Crooks et al., 2004), with positions numbered 
according to the Arcobacter butzleri voltage-gated sodium 
channel (NavAb) sequence (Payandeh etal., 2011). 

The height of the columns in Fig. 3 suggests that con- 
served residues on the VSD occur with periodicity. This 
trend was noticed in an early study done with a smaller 
MSA of VSD sequences (Guda et al., 2007). To quantify 
this observation, we calculate the Kullback-Leibler di- 
vergence (Dkl) for each position in the MSA, as de- 
scribed in Materials and methods. In general, Dkl values 
quantify the additional amount of information (in nats) 
differentiating the empirical distribution of amino acids 
at a certain position from a suitable reference distribu- 
tion (Halabi et al., 2009) . For instance, residues located 
in the TM region are expected to be mostly hydrophobic; 
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Figure 3. Sequence logos of the VSD. Black, GAVILMWF; green, 
TSNQY; blue, RKH; red, DE; purple, GP. Residue numbering cor- 
responds to homologous positions in NavAb. 



a TM position with a distribution significantly enriched 
in hydrophilic residues arguably suggests evolutionary 
pressure and results in a large value of Dkl. Therefore, 
we use Dkl as an empirical site-specific measure of evo- 
lutionary pressure. 

In Fig. 4 A, we map high Dkl residues above a thresh- 
old (1.1 nats) onto the NavAb VSD and notice that the 
internal core of the bundle has the highest Dkl- The 
outer surface of the VSD has much lower Dkl, consis- 
tent with the idea that these positions make nonspecific 
interactions with membrane lipid tails and are thus sub- 
ject to less strict evolutionary constraints (Cuello et al., 
2004; Schmidt et al., 2006). Interestingly, the top of S3 
(termed "S3b," consisting of positions 86-91) does not 
exhibit this periodicity (Fig. 4 D) , suggesting a lack of 
evolutionary pressure on this region. Fig. 4 D shows the 
relative variation of Dkl in the VSD. 

Unsurprisingly, the S4 gating charges (Fig. 4 C, posi- 
tions 99, 102, 105, and 108) and the S1-S3 counter- 
charges (Fig. 4, B and C, positions 32, 59, and 80) exhibit 
high Dkl with respect to the rest of the VSD. This is con- 
sistent with their conspicuous placement as charged 
particles in the TM region and with their conserved 
mechanistic importance among VSDs (Aggarwal and 
MacKinnon, 1996; Seoh et al., 1996; Tiwari-Woodruff 
et al., 2000). Extensive mutagenesis studies have identi- 
fied an SI isoleucine (Fig. 4 B, position 22) and an S2 
phenylalanine (Fig. 4 B, position 33) as the primary 
constituents of a conserved hydrophobic region pres- 
ent in VSDs (Tao et al., 2010; Lacroix and Bezanilla, 
2011, 2012). Mutation of residues in this region modu- 
lates both the kinetics and voltage sensitivity of S4 trans- 
lation. Both of these positions have high Dkl (Fig. 4 D) . 
Interestingly, voltage-gated proton channels are not sig- 
nificantly represented in the dataset; the position iden- 
tified as the selectivity filter in hHvl (Fig. 4 B, position 
25 and hHvl D112) shows high Dkl (43), suggesting a 
functional significance for this position, even in VSDs 
that do not conduct protons. 

Given that Dkl measures evolutionary pressure and 
that identified functional residues exhibit high Dkl, it 
may be valuable to investigate other high Dkl positions 
throughout the VSD. To the best of our knowledge, resi- 
dues at the lower membrane interface (Fig. 4, B and C, 
positions 11, 14, 63, 71, 74, 76, and 77) have not been 
experimentally characterized. From Fig. 3 we see that 
these positions take polar, aromatic, and positively 
charged amino acids. These amino acids are commonly 
observed at the membrane interface, specifically at the 
inner leaflet, as they are involved in anchoring the pro- 
tein to membrane headgroups (von Heijne and Gavel, 
1988; Yau et al., 1998). However, in VSDs their relative 
abundance cannot be explained only by their localiza- 
tion. Indeed, our formulation of Dkl explicidy accounts 
for the propensity for observing these residues at the 
lower membrane interface. That the presence of these 
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amino acids has been maintained throughout evolution 
is suggestive of distinct, conserved functional roles. 

DCA of the VSD 

We performed DCA on the VSD and ranked pairs accord- 
ing to their EC scores (see Materials and methods for 
the definition of EC scores) . Simply, the EC score for a 
pair of residues is a heuristic measure of the total cou- 
pling as inferred from the probabilistic modeling. It is 
important to note that the distribution of EC scores does 
not show an obvious threshold of significance. Fig. 5 E 
shows a histogram of EC scores for non-neighbor pairs 
(more than five residues apart in sequence) . The distri- 
bution appears to be approximately Gaussian with a fat 
right tail. Because we have no prior expectation about 
the distribution of EC scores, we set an arbitrary thresh- 
old of 0.10 (approximately corresponding to a standard 
deviation from the mean), as depicted by the yellow re- 
gion in Fig. 5 E. Fig. 5 C only considers those couplings 
with EC scores >0.10. Considering only positions sepa- 
rated by at least five residues in the primary structure, 
we find that 20 of the top 23 pairs are in direct physical 
contact in the crystal structure of the activated NavAb 
VSD (Fig. 5 A). This gives a true-positive rate of 87% 
contact prediction for this first small subset of pairs. 



This is consistent with reported true-positive rates for 
contact prediction with pseudolikelihood maximization 
DCA (Ekeberg et al., 2013). These 20 contacting pairs 
define the interfaces between S1-S2 and S2-S3. Curi- 
ously, only two pairs in the top 23 involve S4 coupled to 
the S1-S3 bundle, and these pairs are not in physical 
contact in the NavAb VSD structure (Fig. 3, positions 
25-96 and 49-96) . For comparison with Fig. 5 C, con- 
tact maps of four other experimentally determined VSD 
structures are available in Fig. S2, accompanied by 
structural superpositions with the NavAb VSD. 

To further investigate the spatial distribution of pairs 
with large EC scores on the VSD structure, we compare 
the EC score matrix to the contact map of the NavAb 
VSD (Fig. 5, B and C). The trend is similar to that ob- 
served in the top pairs: the contacting interfaces of 
S1-S2 and S2-S3 are well-defined in Fig. 5 C, whereas 
the contacting interfaces of S1-S4 and S3-S4 observed 
in the contact map are completely absent in the EC 
score matrix (shown schematically in Fig. 5 D) . EC score 
matrices calculated from partitioned MSAs of only 
Cav/Nav or Kv VSDs show similar features (Fig. S3, A 
and B) . In the next section we consider how design 
principles of VSD function may have constrained evolu- 
tion to generate these observations. 
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Figure 4. Single-site analysis of the VSD. (A) Cartoon representation of the NavAb VSD. Residues with a Dkl score greater than the 
1.1-nat threshold are represented as a contiguous blue surface. Residues with a Dkl score below the threshold are transparent. (B and C) 
Helices S1-S4 of the NavAb VSD. Residues with a Dkl score greater than the 1.1-nat threshold are represented as licorice and colored as 
in Fig. 1. (D) Dkl score by NavAb residue position with residue positions above the 1.1-Dkl threshold represented as blue bars. 
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DISCUSSION 

In this study, we have constructed a large, diverse MSA of 
VSD sequences and identified patterns with two different 
but complementary sequence analysis approaches. By 
calculating the Dkl of each position in the alignment, we 
identified a set of evolutionarily important, core-facing 
VSD positions (Fig. 3, B and C) . Some of these positions 
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Figure 5. DCAof the VSD. (A) Cartoon representation of NavAb 
VSD. Top 20 evolutionarily coupled contacting nonlocal residue 
pairs (more than five residues apart in the primary structure). 
C-a atoms are labeled and colored as in Fig. 3. Several top-scoring 
evolutionarily coupled pairs were excluded, as described in Re- 
sults. Helices are colored as follows: blue, SI; red, S2; white, S3; 
yellow, S4. (B) Contact map for the NavAb VSD. A contact is de- 
fined between two residues if the respective C-|3 atoms (C-a in the 
case of glycine) are within 8 A. (C) EC map for the VSD MSA. 
A few couplings were much higher than the 0.25 maximum shown 
on the colorbar. (D) Schematic showing EC topology of VSD. 
(E) Histogram of the EC scores for nonlocal residue pairs. All 
pairs shown in C fall in the yellow region. 



are well-known determinants of voltage sensitivity, whereas 
others remain unexplored (Seoh et al., 1996; Tao et al., 
2010; Lacroix and Bezanilla, 2012). We then used DCA 
to distinguish pairs of evolutionarily coupled residues. 
The distribution of strongly coupled pairs is inhomoge- 
neous throughout the VSD, suggesting localized regions 
where specific evolutionary tuning has occurred. Espe- 
cially strong EC is observed at the S1-S2 and S2-S3 
interfaces, whereas the contacting S1-S4 and S3-S4 hy- 
drophobic interfaces show no signal of coevolution. Con- 
sidering the results, we now attempt to characterize the 
design principles followed by evolution to shape the ob- 
served sequence ensemble of VSDs. 

Coevolving, state-dependent residue-residue contacts 
We find that ECs are informative of the conformational 
transition of S4 upon VSD activation. Specifically, we 
identified two evolutionarily coupled residue pairs, N49- 
E96 and N25-E96, which are not in contact in the acti- 
vated NavAb crystal structure. Several independent lines 
of evidence show a physical interaction between N49- 
E96 positions in the resting state of several VSD homo- 
logues (DeCaen et al., 2009; Wu et al., 2010). Recent 
models of NavAb and Kvl.2 deactivation show both 
pairs in contact in a resting state (Fig. 6, D, N49-E96 and 
N25-E96, and E, R287-E183) (Bjelkmar et al., 2009; 
Amaraletal., 2012). 

Mutations in position 49 have been shown to have 
dramatic effects on the activation threshold in several 
VSDs (Papazian et al., 1995; Planells-Cases et al., 1995; 
Gamal El-Din et al., 2013) . Pless et al. (201 1 ) found that 
of the three countercharges in the Shaker K + channel 
(Fig. 3, positions 49, 59, and 80; see Tables S2 and S3 for 
correspondence to Shaker numbering), alteration of 
the charge at position 49 had the greatest effect on the 
activation threshold. This effect can be understood as a 
modulation of the interaction of position 49 with the 
gating charges, selectively stabilizing specific states along 
the activation pathway (DeCaen et al., 2009; Delemotte 
et al., 2011; Amaral et al., 2012; Henrion et al., 2012; 
Jensen et al., 2012). We anticipated a coevolution signal 
between position 49 and all gating charge positions that 
exhibit some variability (namely, positions 96, 105, and 
108). Surprisingly, we found coevolution only between 
49 and 96, and this signal is very strong. 

Because position 49 and 96 have been shown to inter- 
act in resting states of the VSD and mutation of 49 has 
been shown to stabilize the resting state, we hypothesize 
that the coevolution between 49 and 96 is the signature 
of an evolutionary tuning of the resting-state stability. 

Position 25 is proximal to 49 in the NavAb VSD (Fig. 6 C) , 
exhibits a qualitatively similar distribution of polar and 
acidic amino acids (Fig. 3), and also coevolves with 96 
(EC score of 0.196). Although position 25 has been 
shown to determine proton selectivity in voltage-gated 
proton channels, we are not aware of any studies that 
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probe the effects of mutating 25 on VSD activation 
threshold (Musset et al., 201 1) . In analogy to position 49, 
we hypothesize that it may serve a similar functional role 
in tuning the relative stability of VSD activation states. 

A lack of expected S4 coevolution 

Previous reports describing DCA have made strong par- 
allels between the EC score matrix and the contact map 
(Weigt etal., 2009; Morcos et al., 2011; Hopf etal., 2012; 
Ekeberg et al., 2013). We were therefore surprised that 
the S1-S4 and S3-S4 contacting interfaces were absent 
in the EC score matrix. However, chimeragenesis and 
deletion analysis experiments suggest that the S3b-S4 
paddle is modular (Alabi et al., 2007; Bosmans et al., 
2008; Mishina et al., 2012; Kalia and Swartz, 2013; 
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Figure 6. Coevolving residue-residue contacts. (A) Close-up 
of the coevolving "hydrophobic nest" that surrounds conserved 
hydrophobic residues F56 and 122. (B) Close-up of the coevolv- 
ing "cage" that surrounds F71. (C) Coevolving network of polar 
and apolar residues at the extracellular mouth of the VSD. (D) Com- 
putationally modeled activated and resting states of the NavAb 
VSD, as described in Amaral et al. (2012). Residues of interest 
are labeled. (E) Computationally modeled activated and rest- 
ing (C3) states of the Kvl.2 VSD, as described in Bjelkmar et al. 
(2009). 



Xu et al., 2013). This modularity implies that as long as 
conserved features of S4 are maintained among chime- 
ras (such as the gating charges and the hydrophobic 
faces of the helix), specific coevolution between con- 
tacting residue-residue pairs is not required for basic 
voltage-sensor function. This would be consistent with 
a free-energy landscape of activation dominated by 
coupled interactions between gating charges and coun- 
tercharges, which do exhibit functionally relevant co- 
evolution in at least one case (discussed in detail in the 
previous section) . Although a family-specific tuning of 
the complementarity between the S1-S4 and S3-S4 in- 
terfaces cannot be excluded, the lack of specific coevo- 
lution between these interfaces suggests that this is not 
a necessary feature of VSDs. 

Hydrophilic tuning of the VSD lumen 

Because the gating charges have to slide into and out of 
the VSD, the extracellular mouth of the bundle must be 
solvated. We hypothesize that the chemical character of 
this "water crevice" depends on a subtle balance between 
hydrophilic and hydrophobic residues. This is high- 
lighted by the fact that mutations in this region dramati- 
cally change the gating kinetics (Lacroix et al., 2013). 
Compensatory mutations could help maintain a specific 
hydrophilic character. Coevolution between several pairs 
of residues supports this hypothesis (Fig. 6 C) . 

On the opposite, intracellular side of the VSD, R63 
and S77 are also observed to be evolutionarily coupled. 
A stabilizing pair of hydrophilic residues at the intracel- 
lular mouth may facilitate the solvation of this side of 
the VSD. Additionally, W76 and T15 coevolve but are 
not in physical proximity in the activated NavAb struc- 
ture, lying on opposite sides of position 117 in the acti- 
vated states of NavAb and Kvl.2 (Fig. 6, D and E, 
"Activated" states, and Table S3). In modeled resting 
states of both NavAb and Kvl.2, another positive charge 
sits between these two positions (Fig. 6, D and E, "Rest- 
ing" states). The Dkl of 76 is very high, and 15 is right 
beneath the chosen Dkl threshold. We propose that co- 
evolution between these residues tunes the energetics 
of the S4 transition. 

A hydrophobic nest 

Although the importance of the conserved gating and 
countercharges was recognized early in studies of volt- 
age-gated ion channels, the role of the hydrophobic re- 
gion was recognized much more recendy (Tao et al., 
2010; Lacroix and Bezanilla, 2011). The experimentally 
characterized constituents of the hydrophobic region 
include an SI isoleucine and an S2 phenylalanine (Fig. 3, 
122 and F56) . Mechanistically, this region hinders the 
diffusion of ions and water through the VSD and acts as 
an energetic barrier for the movement of S4 (Schwaiger 
et al., 2013). An evolutionarily coupled network of resi- 
dues surrounds 122 and F56 (Fig. 6 A) . We propose that 
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these residues form a "hydrophobic nest" that constrains 
the geometry in this region. Shape complementarity of 
residues in the hydrophobic nest is maintained by com- 
pensatory mutations, which are in turn detected as ECs 
by DCA. 

A bulky cage 

A similar feature is observed at the base of the SI and S2 
helices. Single-site residue frequency analysis highlights 
F7l as a potentially functionally important residue. This 
hypothesis is supported by the presence of a highly con- 
nected network of couplings surrounding F7l (Fig. 6 B) . 
Residues in this cluster are generally bulky hydrophobic 
residues. We speculate that F71 and its network of co- 
evolving bulky neighbors define a "cage" that stabilizes 
the bundle in the region containing the acidic residues 
E59 and D80. These countercharges are crucial for the 
activation mechanism and are involved in highly ener- 
getic electrostatic interactions; the requirement for their 
correct positioning resulted in a fine-tuning of the heli- 
cal interface that separates them. 

Concluding remarks 

The VSD is a modular molecular machine with a mech- 
anism of activation conserved over billions of years (Yu 
and Catterall, 2004) . The distribution of extant VSD se- 
quences results from an evolutionary exploration of 
sequence space constrained by the functional require- 
ments of voltage sensing. Sequence analysis of this dis- 
tribution reveals the nature of these constraints. Here, 
we applied an information-theoretical approach to de- 
scribe site-specific frequency distributions of residues. 
Then, we built a probabilistic model accounting for 
both site-specific residue propensity and pairwise resi- 
due-residue statistical couplings. In both cases, we find 
that evolutionary constraints are exerted on specific sets 
of residues or pairs of residues, suggesting a sparse net- 
work of residue-residue interactions in which evolution 
attends to only certain nodes and edges. Previous stud- 
ies have taken similar sequence analysis approaches and 
have proven successful in identifying experimentally 
validated features of membrane proteins (Stiel et al., 
2003; Lee et al., 2009; Pless et al., 2013). Here, we take 
a broader perspective and provide a comprehensive 
picture of the design principles of the VSD: (a) state- 
dependent coevolving contacts between 49-96 (and pos- 
sibly 25-96) tune the relative stability of the resting state; 
(b) absence of specific, tuned interactions along the hy- 
drophobic interfaces of S4 and neighboring SI and S3 
helices enables S4 translation; (c) networks of hydro- 
philic and hydrophobic residues lining the lumen of 
the VSD tune the access of penetrating water and gating 
charges; and (d) conserved residue positions, such as 
the countercharges 122 or F56, are surrounded by co- 
evolving networks of hydrophobic residues stabilizing 
their orientation. 



Finally, the design principles oudined in this study are 
encoded in the probabilistic model underlying DCA, the 
Potts model. This model is capable of sampling sequence 
space, emitting VSD-like sequences that by construction 
respect the pairwise correlations observed in real VSD 
sequences. This raises the exciting possibility of testing 
the validity of the design principles through the evolu- 
tionarily informed de novo design of VSDs. 
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